{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datawhale 智慧海洋建设-Task3 特征工程" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "此部分为智慧海洋建设竞赛的特征工程模块,通过特征工程,可以最大限度地从原始数据中提取特征以供算法和模型使用。通俗而言,就是通过X,创造新的X'以获得更好的训练、预测效果。\n", "\n", "“数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已”——机器学习界;\n", "\n", "类似的,吴恩达曾说过:“特征工程不仅操作困难、耗时,而且需要专业领域知识。应用机器学习基本上就是特征工程。”\n", "\n", "\n", "赛题:智慧海洋建设\n", "\n", "特征工程的目的:\n", "\n", "- 特征工程是一个包含内容很多的主题,也被认为是成功应用机器学习的一个很重要的环节。如何充分利用数据进行预测建模就是特征工程要解决的问题! “实际上,所有机器学习算法的成功取决于如何呈现数据。” “特征工程是一个看起来不值得在任何论文或者书籍中被探讨的一个主题。但是他却对机器学习的成功与否起着至关重要的作用。机器学习算法很多都是由于建立一个学习器能够理解的工程化特征而获得成功的。”——ScottLocklin,in “Neglected machine learning ideas”\n", "\n", "\n", "- 数据中的特征对预测的模型和获得的结果有着直接的影响。可以这样认为,特征选择和准备越好,获得的结果也就越好。这是正确的,但也存在误导。预测的结果其实取决于许多相关的属性:比如说能获得的数据、准备好的特征以及模型的选择。\n", "\n", "\n", "- 上分!:) 毫不夸张的说在基本的数据挖掘类比赛中,特征工程就是你和topline的距离。\n", "\n", "项目地址:https://github.com/datawhalechina/team-learning-data-mining/tree/master/wisdomOcean\n", "\n", "\n", "比赛地址:https://tianchi.aliyun.com/competition/entrance/231768/introduction?spm=5176.12281957.1004.8.4ac63eafE1rwsY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 学习目标" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. 学习特征工程的基本概念\n", "\n", "\n", "2. 学习topline代码的特征工程构造方法,实现构建有意义的特征工程\n", "\n", "\n", "3. 完成相应学习打卡任务" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 内容介绍" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0. 特征工程概述\n", "\n", "1. 赛题特征工程\n", " - 业务特征,根据先验知识进行专业性的特征构建\n", "2. 分箱特征\n", " - v、x、y的分箱特征\n", " - x、y分箱后并构造区域\n", "3. DataFramte特征\n", " - count计数值\n", " - shift偏移量\n", " - 统计特征\n", "4. Embedding特征\n", " - Word2vec构造词向量\n", " - NMF提取文本的主题分布\n", "5. 总结" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 特征工程概述" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "特征工程大体可分为3部分,特征构建、特征提取和特征选择。\n", "\n", "- 特征构建\n", "\n", "“从数学的角度讲,特征工程就是将原始数据空间变换到新的特征空间,或者说是换一种数据的表达方式,在新的特征空间中,模型能够更好地学习数据中的规律。因此,特征抽取就是对原始数据进行变换的过程。大多数模型和算法都要求输入是维度相同的实向量,因此特征工程首先需要将原始数据转化为实向量。”\n", "其主要包含内容有:\n", "\n", " + 探索性数据分析\n", " + 数值特征\n", " + 类别特征\n", " + 时间特征\n", " + 文本特征\n", "\n", "- 特征提取和特征选择\n", "\n", "特征提取和特征选择概念上来说很像,其实特征提取指的是通过特征转换得到一组具有明显物理或统计意义的特征。而特征选择就是在特征集里直接挑出具有明显物理或统计意义的特征。\n", "\n", "与特征提取是从原始数据中构造新的特征不同,特征选择是从这些特征集合中选出一个子集。特征选择对于机器学习应用来说非常重要。特征选择也称为属性选择或变量选择,是指为了构建模型而选择相关特征子集的过程。特征选择的目的有如下三个。\n", "\n", " + 简化模型,使模型更易于研究人员和用户理解。可解释性不仅让我们对模型效果的稳定性有更多的把握,而且也能为业务运营等工作提供指引和决策支持。\n", "\n", " + 改善性能。特征选择的另一个作用是节省存储和计算开销。\n", "\n", " + 改善通用性、降低过拟合风险。特征的增多会大大增加模型的搜索空间,大多数模型所需要的训练样本数目随着特征数量的增加而显著增加,特征的增加虽然能更好地拟合训练数据,但也可能增加方差。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "————————————————————————————————————————————————————————————————————" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注:本ipynb着重学习topline代码的特征工程构造方法,效果需要模型方面进行预测打分" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "————————————————————————————————————————————————————————————————————" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "导入所需库和数据\n", "\n", "补充:\n", "下述库中的geopandas安装可能会遇到问题,可通过如下博客解决:\n", "\n", "https://qianni1997.github.io/2019/07/26/geopandas-install/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:44.860521Z", "start_time": "2021-04-06T09:40:29.681465Z" } }, "outputs": [], "source": [ "import gc\n", "import multiprocessing as mp\n", "import os\n", "import pickle\n", "import time\n", "import warnings\n", "from collections import Counter\n", "from copy import deepcopy\n", "from datetime import datetime\n", "from functools import partial\n", "from glob import glob\n", "\n", "import geopandas as gpd\n", "import lightgbm as lgb\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from gensim.models import FastText, Word2Vec\n", "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n", "from pyproj import Proj\n", "from scipy import sparse\n", "from scipy.sparse import csr_matrix\n", "from sklearn import metrics\n", "from sklearn.cluster import DBSCAN\n", "from sklearn.decomposition import NMF, TruncatedSVD\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.metrics import f1_score, precision_recall_fscore_support\n", "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.preprocessing import LabelEncoder\n", "from tqdm import tqdm\n", "\n", "os.environ['PYTHONHASHSEED'] = '0'\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:45.155446Z", "start_time": "2021-04-06T09:40:44.861521Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/7000 [00:00 max_lines:break\n", " \n", " p = paths[t]\n", " with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:\n", " next(f)\n", " for line in f.readlines():\n", " tmp.append(line.strip().split(','))\n", " if len(tmp) > max_lines:break\n", " \n", " tmp_df = pd.DataFrame(tmp)\n", " tmp_df.columns = ['渔船ID', 'x', 'y', '速度', '方向', 'time', 'type']\n", " return tmp_df\n", "\n", "TRAIN_PATH = \"../input/hy_round1_train_20200102/\"\n", "# 采样数据行数\n", "max_lines = 2000\n", "df = get_data(TRAIN_PATH,max_lines=max_lines)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:45.217623Z", "start_time": "2021-04-06T09:40:45.157392Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idxyvdirtimelabeldatehourmonthweekday
006.152038e+065.124873e+062.591021900-11-10 11:58:1901900-11-1011115
106.151230e+065.125218e+062.701131900-11-10 11:48:1901900-11-1011115
206.150421e+065.125563e+062.701161900-11-10 11:38:1901900-11-1011115
306.149612e+065.125907e+063.29951900-11-10 11:28:1901900-11-1011115
406.148803e+065.126252e+063.181081900-11-10 11:18:1901900-11-1011115
\n", "
" ], "text/plain": [ " id x y v dir time label \\\n", "0 0 6.152038e+06 5.124873e+06 2.59 102 1900-11-10 11:58:19 0 \n", "1 0 6.151230e+06 5.125218e+06 2.70 113 1900-11-10 11:48:19 0 \n", "2 0 6.150421e+06 5.125563e+06 2.70 116 1900-11-10 11:38:19 0 \n", "3 0 6.149612e+06 5.125907e+06 3.29 95 1900-11-10 11:28:19 0 \n", "4 0 6.148803e+06 5.126252e+06 3.18 108 1900-11-10 11:18:19 0 \n", "\n", " date hour month weekday \n", "0 1900-11-10 11 11 5 \n", "1 1900-11-10 11 11 5 \n", "2 1900-11-10 11 11 5 \n", "3 1900-11-10 11 11 5 \n", "4 1900-11-10 11 11 5 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 基本预处理\n", "label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}\n", "label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}\n", "name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label'}\n", "\n", "df.rename(columns = name_dict, inplace = True)\n", "df['label'] = df['label'].map(label_dict1)\n", "cols = ['x','y','v']\n", "for col in cols:\n", " df[col] = df[col].astype('float')\n", "df['dir'] = df['dir'].astype('int')\n", "df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')\n", "df['date'] = df['time'].dt.date\n", "df['hour'] = df['time'].dt.hour\n", "df['month'] = df['time'].dt.month\n", "df['weekday'] = df['time'].dt.weekday\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "数据说明:\n", "\n", " - id:渔船ID,整数\n", " - x:记录位置横坐标,浮点数\n", " - y:记录位置纵坐标,浮点数\n", " - v:记录速度,浮点数\n", " - dir:记录航向,整数\n", " - time:时间,文本\n", " - label:需要预测的标签,整数" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 赛题特征工程" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 构造各点的(x、y)坐标与特定点(6165599,5202660)的距离" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:51.254522Z", "start_time": "2021-04-06T09:40:51.223636Z" } }, "outputs": [ { "data": { "text/plain": [ "0 78959.780945\n", "1 78763.845006\n", "2 78577.185266\n", "3 78399.867568\n", "4 78231.955018\n", "Name: base_dis_diff, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['x_dis_diff'] = (df['x'] - 6165599).abs()\n", "df['y_dis_diff'] = (df['y'] - 5202660).abs()\n", "df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5 \n", "del df['x_dis_diff'],df['y_dis_diff'] \n", "df['base_dis_diff'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 对时间,小时进行白天、黑天进行划分,5-20为白天1,其余为黑天0" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:52.721776Z", "start_time": "2021-04-06T09:40:52.696829Z" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "3 1\n", "4 1\n", "Name: day_nig, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['day_nig'] = 0\n", "df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1\n", "df['day_nig'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 根据月份划分季度" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:54.053897Z", "start_time": "2021-04-06T09:40:54.030942Z" } }, "outputs": [], "source": [ "# 季度\n", "df['quarter'] = 0\n", "df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1\n", "df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2\n", "df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3\n", "df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 动态速度,速度变化,角度变化,xy相似性等特征" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:55.098791Z", "start_time": "2021-04-06T09:40:55.062887Z" } }, "outputs": [], "source": [ "temp = df.copy()\n", "temp.rename(columns={'id':'ship','dir':'d'},inplace=True)\n", "\n", "# 给速度一个等级\n", "def v_cut(v):\n", " if v < 0.1:\n", " return 0\n", " elif v < 0.5:\n", " return 1\n", " elif v < 1:\n", " return 2\n", " elif v < 2.5:\n", " return 3\n", " elif v < 5:\n", " return 4\n", " elif v < 10:\n", " return 5\n", " elif v < 20:\n", " return 5\n", " else:\n", " return 6\n", "# 统计每个ship的对应速度等级的个数\n", "def get_v_fea(df):\n", "\n", " df['v_cut'] = df['v'].apply(lambda x: v_cut(x))\n", " tmp = df.groupby(['ship', 'v_cut'], as_index=False)['v_cut'].agg({'v_cut_count': 'count'})\n", " # 通过pivot构建透视表\n", " tmp = tmp.pivot(index='ship', columns='v_cut', values='v_cut_count')\n", "\n", " new_col_nm = ['v_cut_' + str(col) for col in tmp.columns.tolist()]\n", " tmp.columns = new_col_nm\n", " tmp = tmp.reset_index() # 把index恢复成data\n", "\n", " return tmp\n", "\n", "c1 = get_v_fea(temp)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:56.796042Z", "start_time": "2021-04-06T09:40:56.769114Z" } }, "outputs": [], "source": [ "# 方位进行16均分\n", "def add_direction(df):\n", " df['d16'] = df['d'].apply(lambda x: int((x / 22.5) + 0.5) % 16 if not np.isnan(x) else np.nan)\n", " return df\n", "def get_d_cut_count_fea(df):\n", " df = add_direction(df)\n", " tmp = df.groupby(['ship', 'd16'], as_index=False)['d16'].agg({'d16_count': 'count'})\n", " tmp = tmp.pivot(index='ship', columns='d16', values='d16_count')\n", " new_col_nm = ['d16_' + str(col) for col in tmp.columns.tolist()]\n", " tmp.columns = new_col_nm\n", " tmp = tmp.reset_index()\n", " return tmp\n", "\n", "c2 = get_d_cut_count_fea(temp)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:57.574641Z", "start_time": "2021-04-06T09:40:57.539739Z" } }, "outputs": [], "source": [ "def get_v0_fea(df):\n", " # 统计速度为0的个数,以及速度不为0的统计量\n", " df_zero_count = df.query(\"v==0\")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(\n", " {'num_zero_v': 'count'})\n", " df_not_zero_agg = df.query(\"v!=0\")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(\n", " {'v_max_drop_0': 'max',\n", " 'v_min_drop_0': 'min',\n", " 'v_mean_drop_0': 'mean',\n", " 'v_std_drop_0': 'std',\n", " 'v_median_drop_0': 'median',\n", " 'v_skew_drop_0': 'skew'})\n", " tmp = df_zero_count.merge(df_not_zero_agg, on='ship', how='left')\n", "\n", " return tmp\n", "\n", "c3 = get_v0_fea(temp)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:58.057987Z", "start_time": "2021-04-06T09:40:57.967114Z" } }, "outputs": [], "source": [ "def get_percentiles_fea(df_raw):\n", " key = ['x', 'y', 'v', 'd']\n", " temp = df_raw[['ship']].drop_duplicates('ship')\n", " for i in range(len(key)):\n", " # 加入x,v,d,y的中位数和各种位数\n", " tmp_dscb = df_raw.groupby('ship')[key[i]].describe(\n", " percentiles=[0.05] + [ii / 1000 for ii in range(125, 1000, 125)] + [0.95])\n", " raw_col_nm = tmp_dscb.columns.tolist()\n", " new_col_nm = [key[i] + '_' + col for col in raw_col_nm]\n", " tmp_dscb.columns = new_col_nm\n", " tmp_dscb = tmp_dscb.reset_index()\n", " # 删掉多余的统计特征\n", " tmp_dscb = tmp_dscb.drop([f'{key[i]}_count', f'{key[i]}_mean', f'{key[i]}_std',\n", " f'{key[i]}_min', f'{key[i]}_max'], axis=1)\n", "\n", " temp = temp.merge(tmp_dscb, on='ship', how='left')\n", " return temp\n", "\n", "c4 = get_percentiles_fea(temp)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:58.605497Z", "start_time": "2021-04-06T09:40:58.425813Z" } }, "outputs": [], "source": [ "def get_d_change_rate_fea(df):\n", " import math\n", " import time\n", " temp = df.copy()\n", " # 以ship、time为主键进行排序\n", " temp.sort_values(['ship', 'time'], ascending=True, inplace=True)\n", " # 通过shift求相邻差异值,注意学习.shift(-1,1)的含义\n", " temp['timenext'] = temp.groupby('ship')['time'].shift(-1)\n", " temp['ynext'] = temp.groupby('ship')['y'].shift(-1)\n", " temp['xnext'] = temp.groupby('ship')['x'].shift(-1)\n", " # 将shift得到的差异量进行填充,为什么会有空值NaN?\n", " # 因为shift的起始位置是没法比较的,故用空值来代替\n", " temp['ynext'] = temp['ynext'].fillna(method='ffill')\n", " temp['xnext'] = temp['xnext'].fillna(method='ffill')\n", " # 这里笔者的理解是ynext/xnext,而不需要减去y和x,因为ynext和xnext本身就是偏移量了\n", " temp['angle_next'] = (temp['ynext'] - temp['y']) / (temp['xnext'] - temp['x'])\n", " temp['angle_next'] = np.arctan(temp['angle_next']) / math.pi * 180\n", " temp['angle_next_next'] = temp['angle_next'].shift(-1)\n", " temp['timediff'] = np.abs(temp['timenext'] - temp['time'])\n", " temp['timediff'] = temp['timediff'].fillna(method='ffill')\n", " temp['hc_xy'] = abs(temp['angle_next_next'] - temp['angle_next'])\n", " # 对于hc_xy这列的值>180度的,进行修改成360度求差,仅考虑与水平线的角度\n", " temp.loc[temp['hc_xy'] > 180, 'hc_xy'] = (360 - temp.loc[temp['hc_xy'] > 180, 'hc_xy'])\n", " temp['hc_xy_s'] = temp.apply(lambda x: x['hc_xy'] / x['timediff'].total_seconds(), axis=1)\n", "\n", " temp['d_next'] = temp.groupby('ship')['d'].shift(-1)\n", " temp['hc_d'] = abs(temp['d_next'] - temp['d'])\n", " temp.loc[temp['hc_d'] > 180, 'hc_d'] = 360 - temp.loc[temp['hc_d'] > 180, 'hc_d']\n", " temp['hc_d_s'] = temp.apply(lambda x: x['hc_d'] / x['timediff'].total_seconds(), axis=1)\n", "\n", " temp1 = temp[['ship', 'hc_xy_s', 'hc_d_s']]\n", " xy_d_rate = temp1.groupby('ship')['hc_xy_s'].agg({'hc_xy_s_max': 'max',\n", " })\n", " xy_d_rate = xy_d_rate.reset_index()\n", " d_d_rate = temp1.groupby('ship')['hc_d_s'].agg({'hc_d_s_max': 'max',\n", " })\n", " d_d_rate = d_d_rate.reset_index()\n", "\n", " tmp = xy_d_rate.merge(d_d_rate, on='ship', how='left')\n", " return tmp\n", "\n", "c5 = get_d_change_rate_fea(temp)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:40:59.036757Z", "start_time": "2021-04-06T09:40:58.989886Z" } }, "outputs": [], "source": [ "f1 = temp.merge(c1,on='ship',how='left')\n", "f1 = f1.merge(c2,on='ship',how='left')\n", "f1 = f1.merge(c3,on='ship',how='left')\n", "f1 = f1.merge(c4,on='ship',how='left')\n", "f1 = f1.merge(c5,on='ship',how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 分箱特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## v、x、y的分箱特征" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:00.267094Z", "start_time": "2021-04-06T09:41:00.126455Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
v_binx_bin1x_bin2x_bin1_countx_bin2_countx_bin1_id_nuniquex_bin2_id_nuniquey_bin1y_bin2y_bin1_count...y_bin1_id_nuniquey_bin2_id_nuniquex_y_bin1x_bin1_y_bin1_countx_y_bin2x_bin2_y_bin2_countx_y_maxy_x_maxx_y_miny_x_min
00.00615.01168220512.02...210103-115954.6751570.0000000.00000049790.106760
10.01615.028221512.02...1111030.0000000.00000053070.048324808.872353
20.02615.028221512.02...1121030.000000-808.87235354707.5120920.000000
31.03614.0277222512.02...1131180.0000000.00000052951.293120808.787673
42.04614.0277222512.02...1141180.000000-808.78767355461.6530280.000000
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " v_bin x_bin1 x_bin2 x_bin1_count x_bin2_count x_bin1_id_nunique \\\n", "0 0.0 0 615.0 116 8 2 \n", "1 0.0 1 615.0 2 8 2 \n", "2 0.0 2 615.0 2 8 2 \n", "3 1.0 3 614.0 2 77 2 \n", "4 2.0 4 614.0 2 77 2 \n", "\n", " x_bin2_id_nunique y_bin1 y_bin2 y_bin1_count ... y_bin1_id_nunique \\\n", "0 2 0 512.0 2 ... 2 \n", "1 2 1 512.0 2 ... 1 \n", "2 2 1 512.0 2 ... 1 \n", "3 2 2 512.0 2 ... 1 \n", "4 2 2 512.0 2 ... 1 \n", "\n", " y_bin2_id_nunique x_y_bin1 x_bin1_y_bin1_count x_y_bin2 \\\n", "0 1 0 1 0 \n", "1 1 1 1 0 \n", "2 1 2 1 0 \n", "3 1 3 1 1 \n", "4 1 4 1 1 \n", "\n", " x_bin2_y_bin2_count x_y_max y_x_max x_y_min y_x_min \n", "0 3 -115954.675157 0.000000 0.000000 49790.106760 \n", "1 3 0.000000 0.000000 53070.048324 808.872353 \n", "2 3 0.000000 -808.872353 54707.512092 0.000000 \n", "3 8 0.000000 0.000000 52951.293120 808.787673 \n", "4 8 0.000000 -808.787673 55461.653028 0.000000 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "\n", "df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱\n", "df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) # 分箱后映射编码\n", "for f in ['x', 'y']:\n", " df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') # x,y位置分箱1000\n", " df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))#编码\n", " df[f + '_bin2'] = df[f] // 10000 # 取整操作\n", " df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) #x,y不同分箱的数量映射\n", " df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) #数量映射\n", " df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')#基于分箱1 id数量映射\n", " df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')#基于分箱2 id数量映射\n", "for i in [1, 2]:\n", " # 特征交叉x_bin1(2),y_bin1(2) 形成类别 统计每类数量映射到列 \n", " df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')\n", " df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(\n", " dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))\n", " )\n", " df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())\n", "for stat in ['max', 'min']:\n", " # 统计x_bin1 y_bin1的最大最小值\n", " df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)\n", " df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 将x、y进行分箱并构造区域" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:01.197017Z", "start_time": "2021-04-06T09:41:01.181086Z" }, "scrolled": true }, "outputs": [], "source": [ "def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,\n", " y_min=1623579.449434373, y_max=4689471.1780792,\n", " row_bins=4380, col_bins=3136):\n", "\n", " # Establish bins on x direction and y direction\n", " x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)\n", " y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)\n", "\n", " # Determine each x coordinate belong to which bin\n", " traj.sort_values(by='x', inplace=True)\n", " x_res = np.zeros((len(traj), ))\n", " j = 0\n", " for i in range(1, col_bins + 1):\n", " low, high = x_bins[i-1], x_bins[i]\n", " while( j < len(traj)):\n", " # low - 0.001 for numeric stable.\n", " if (traj[\"x\"].iloc[j] <= high) & (traj[\"x\"].iloc[j] > low - 0.001):\n", " x_res[j] = i\n", " j += 1\n", " else:\n", " break\n", " traj[\"x_grid\"] = x_res\n", " traj[\"x_grid\"] = traj[\"x_grid\"].astype(int)\n", " traj[\"x_grid\"] = traj[\"x_grid\"].apply(str)\n", "\n", " # Determine each y coordinate belong to which bin\n", " traj.sort_values(by='y', inplace=True)\n", " y_res = np.zeros((len(traj), ))\n", " j = 0\n", " for i in range(1, row_bins + 1):\n", " low, high = y_bins[i-1], y_bins[i]\n", " while( j < len(traj)):\n", " # low - 0.001 for numeric stable.\n", " if (traj[\"y\"].iloc[j] <= high) & (traj[\"y\"].iloc[j] > low - 0.001):\n", " y_res[j] = i\n", " j += 1\n", " else:\n", " break\n", " traj[\"y_grid\"] = y_res\n", " traj[\"y_grid\"] = traj[\"y_grid\"].astype(int)\n", " traj[\"y_grid\"] = traj[\"y_grid\"].apply(str)\n", "\n", " # Determine which bin each coordinate belongs to.\n", " traj[\"no_bin\"] = [i + \"_\" + j for i, j in zip(\n", " traj[\"x_grid\"].values.tolist(), traj[\"y_grid\"].values.tolist())]\n", " traj.sort_values(by='time', inplace=True)\n", " return traj\n", "\n", "bin_size = 800\n", "col_bins = int((14226964.881853 - 12031967.16239096) / bin_size)\n", "row_bins = int((4689471.1780792 - 1623579.449434373) / bin_size)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:01.968441Z", "start_time": "2021-04-06T09:41:01.791913Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x_gridy_gridno_bin
1606000_0
1605000_0
1604000_0
1603000_0
1602000_0
............
1988000_0
1987000_0
1986000_0
1985000_0
1984000_0
\n", "

2001 rows × 3 columns

\n", "
" ], "text/plain": [ " x_grid y_grid no_bin\n", "1606 0 0 0_0\n", "1605 0 0 0_0\n", "1604 0 0 0_0\n", "1603 0 0 0_0\n", "1602 0 0 0_0\n", "... ... ... ...\n", "1988 0 0 0_0\n", "1987 0 0 0_0\n", "1986 0 0 0_0\n", "1985 0 0 0_0\n", "1984 0 0 0_0\n", "\n", "[2001 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "# 特征x_grid,y_grid,no_bin\n", "df = traj_to_bin(df)\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrame特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## count计数值" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:03.199290Z", "start_time": "2021-04-06T09:41:03.181338Z" } }, "outputs": [], "source": [ "def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):\n", " \"\"\"Find and save the visit frequency of each bin.\"\"\"\n", " visit_count_df = traj_data_df.groupby([\"no_bin\"]).count().reset_index()\n", " visit_count_df = visit_count_df[[\"no_bin\", \"x\"]]\n", " visit_count_df.rename({\"x\":\"visit_count\"}, axis=1, inplace=True)\n", " return visit_count_df\n", "\n", "def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):\n", " \"\"\"Find and save the unique boat visit count of each bin.\"\"\"\n", " unique_boat_count_df = traj_data_df.groupby([\"no_bin\"])[\"id\"].nunique().reset_index()\n", " unique_boat_count_df.rename({\"id\":\"visit_boat_count\"}, axis=1, inplace=True)\n", "\n", " unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,\n", " on=\"no_bin\", how=\"left\")\n", " return unique_boat_count_df\n", "\n", "traj_df = df[[\"id\",\"x\", \"y\",'time',\"no_bin\"]]\n", "bin_to_coord_df = traj_df.groupby([\"no_bin\"]).median().reset_index()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:03.714709Z", "start_time": "2021-04-06T09:41:03.668832Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
visit_countvisit_boat_count
020016
120016
220016
320016
420016
\n", "
" ], "text/plain": [ " visit_count visit_boat_count\n", "0 2001 6\n", "1 2001 6\n", "2 2001 6\n", "3 2001 6\n", "4 2001 6" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "\n", "# DataFrame tmp for finding POIs\n", "visit_count_df = find_save_visit_count_table(\n", " traj_df, bin_to_coord_df)\n", "unique_boat_count_df = find_save_unique_visit_count_table(\n", " traj_df, bin_to_coord_df)\n", "\n", "# # 特征'visit_count','visit_boat_count'\n", "df = df.merge(visit_count_df,on='no_bin',how='left')\n", "df = df.merge(unique_boat_count_df,on='no_bin',how='left')\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## shift偏移量特征" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:04.554883Z", "start_time": "2021-04-06T09:41:04.503988Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x_prev_diffx_next_diffx_prev_next_diffy_prev_diffy_next_diffy_prev_next_diffdist_move_prevdist_move_nextdist_move_prev_nextdist_move_prev_bin
0NaN-911.903731NaNNaN455.919062NaNNaN1019.524696NaNNaN
1911.903731-911.965576-1823.869307-455.919062455.831205911.7502671019.5246961019.5407302039.0654231.0
2911.965576-918.791508-1830.757085-455.83120520.360332476.1915381019.540730919.0170721891.6738311.0
3918.791508-597.354368-1516.145877-20.360332993.1313651013.491697919.0170721158.9400971823.6950782.0
4597.354368-910.468269-1507.822637-993.131365564.4350061557.5663701158.9400971071.2326282167.8427303.0
\n", "
" ], "text/plain": [ " x_prev_diff x_next_diff x_prev_next_diff y_prev_diff y_next_diff \\\n", "0 NaN -911.903731 NaN NaN 455.919062 \n", "1 911.903731 -911.965576 -1823.869307 -455.919062 455.831205 \n", "2 911.965576 -918.791508 -1830.757085 -455.831205 20.360332 \n", "3 918.791508 -597.354368 -1516.145877 -20.360332 993.131365 \n", "4 597.354368 -910.468269 -1507.822637 -993.131365 564.435006 \n", "\n", " y_prev_next_diff dist_move_prev dist_move_next dist_move_prev_next \\\n", "0 NaN NaN 1019.524696 NaN \n", "1 911.750267 1019.524696 1019.540730 2039.065423 \n", "2 476.191538 1019.540730 919.017072 1891.673831 \n", "3 1013.491697 919.017072 1158.940097 1823.695078 \n", "4 1557.566370 1158.940097 1071.232628 2167.842730 \n", "\n", " dist_move_prev_bin \n", "0 NaN \n", "1 1.0 \n", "2 1.0 \n", "3 2.0 \n", "4 3.0 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "\n", "g = df.groupby('id')\n", "for f in ['x', 'y']:\n", " #对x,y坐标进行时间平移 1 -1 2\n", " df[f + '_prev_diff'] = df[f] - g[f].shift(1)\n", " df[f + '_next_diff'] = df[f] - g[f].shift(-1)\n", " df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)\n", " ## 三角形求解上时刻1距离 下时刻-1距离 2距离 \n", "df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))\n", "df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))\n", "df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))\n", "df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')# 2时刻距离等频分箱50\n", "df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(\n", " dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))\n", ") #上一时刻映射编码\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 统计特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 基本统计特征用法" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "补充:\n", "\n", "分组统计特征agg的使用非常重要,在此进行代码示例,详细请参考:\n", "http://joyfulpandas.datawhale.club/Content/ch4.html\n", "\n", "- 请注意{}和[]的使用\n", "\n", "分组标准格式:\n", "\n", "df.groupby(分组依据)[数据来源].使用操作\n", "\n", "先分组,得到\n", "\n", "gb = df.groupby(['School', 'Grade'])\n", "\n", "- 【a】使用多个函数\n", "\n", "gb.agg(['具体方法(如内置函数)'])\n", "\n", "如gb.agg(['sum'])\n", "\n", "\n", "- 【b】对特定的列使用特定的聚合函数\n", "\n", "gb.agg({'指定列':'具体方法'})\n", "\n", "如gb.agg({'Height':['mean','max'], 'Weight':'count'})\n", "\n", "- 【c】使用自定义函数\n", "\n", "gb.agg(函数名或匿名函数)\n", "\n", "如gb.agg(lambda x: x.mean()-x.min())\n", "\n", "- 【d】聚合结果重命名\n", "\n", "gb.agg([\n", " ('重命名的名字',具体方法(如内置函数、自定义函数))\n", "])\n", "\n", "如gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])\n", "\n", "另外需要注意,使用对一个或者多个列使用单个聚合的时候,重命名需要加方括号,否则就不知道是新的名字还是手误输错的内置函数字符串:\n", "\n", "- 下述代码主要使用了\n", "\n", "一种是df.groupby('id').agg{'列名':'方法'},另一种是df.groupby('id')['列名'].agg(字典)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:08.013040Z", "start_time": "2021-04-06T09:41:07.908757Z" } }, "outputs": [], "source": [ "pre_cols = df.columns\n", "\n", "def start(x):\n", " try:\n", " return x[0]\n", " except:\n", " return None\n", "\n", "def end(x):\n", " try:\n", " return x[-1]\n", " except:\n", " return None\n", "\n", "\n", "def mode(x):\n", " try:\n", " return pd.Series(x).value_counts().index[0]\n", " except:\n", " return None\n", "\n", "for f in ['dist_move_prev_bin', 'v_bin']:\n", " # 上一时刻类别 速度类别映射处理\n", " df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))\n", " \n", " # 一系列基本统计量特征 每列执行相应的操作\n", "g = df.groupby('id').agg({\n", " 'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],\n", " 'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],\n", " 'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],\n", " 'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],\n", " 'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],\n", " 'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],\n", " 'x_bin1_y_bin1_count': ['mean', 'max', 'min'],\n", " 'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],\n", " 'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],\n", " 'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],\n", "}).reset_index()\n", "g.columns = ['_'.join(col).strip() for col in g.columns] #提取列名\n", "g.rename(columns={'id_': 'id'}, inplace=True) #重命名id_\n", "cols = [f for f in g.keys() if f != 'id'] #特征列名提取" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:08.666832Z", "start_time": "2021-04-06T09:41:08.616927Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dist_move_prev_bin_senv_bin_senid_countx_bin1_modey_bin1_modex_bin2_modey_bin2_modex_y_bin1_modex_meanx_max...dist_move_prev_mindist_move_prev_sumx_y_min_meanx_y_min_miny_x_min_meany_x_min_minx_y_max_meanx_y_max_miny_x_max_meany_x_max_min
0nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
1nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
2nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
3nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
4nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...41114588611.0508.02526.123711e+066.151439e+06...0.0381420.8405542458.926640.04603.8144720.0-5075.500661-57432.286364-3493.862248-32066.348374
\n", "

5 rows × 54 columns

\n", "
" ], "text/plain": [ " dist_move_prev_bin_sen \\\n", "0 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... \n", "1 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... \n", "2 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... \n", "3 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... \n", "4 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... \n", "\n", " v_bin_sen id_count x_bin1_mode \\\n", "0 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 \n", "1 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 \n", "2 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 \n", "3 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 \n", "4 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 \n", "\n", " y_bin1_mode x_bin2_mode y_bin2_mode x_y_bin1_mode x_mean \\\n", "0 88 611.0 508.0 252 6.123711e+06 \n", "1 88 611.0 508.0 252 6.123711e+06 \n", "2 88 611.0 508.0 252 6.123711e+06 \n", "3 88 611.0 508.0 252 6.123711e+06 \n", "4 88 611.0 508.0 252 6.123711e+06 \n", "\n", " x_max ... dist_move_prev_min dist_move_prev_sum x_y_min_mean \\\n", "0 6.151439e+06 ... 0.0 381420.840554 2458.92664 \n", "1 6.151439e+06 ... 0.0 381420.840554 2458.92664 \n", "2 6.151439e+06 ... 0.0 381420.840554 2458.92664 \n", "3 6.151439e+06 ... 0.0 381420.840554 2458.92664 \n", "4 6.151439e+06 ... 0.0 381420.840554 2458.92664 \n", "\n", " x_y_min_min y_x_min_mean y_x_min_min x_y_max_mean x_y_max_min \\\n", "0 0.0 4603.814472 0.0 -5075.500661 -57432.286364 \n", "1 0.0 4603.814472 0.0 -5075.500661 -57432.286364 \n", "2 0.0 4603.814472 0.0 -5075.500661 -57432.286364 \n", "3 0.0 4603.814472 0.0 -5075.500661 -57432.286364 \n", "4 0.0 4603.814472 0.0 -5075.500661 -57432.286364 \n", "\n", " y_x_max_mean y_x_max_min \n", "0 -3493.862248 -32066.348374 \n", "1 -3493.862248 -32066.348374 \n", "2 -3493.862248 -32066.348374 \n", "3 -3493.862248 -32066.348374 \n", "4 -3493.862248 -32066.348374 \n", "\n", "[5 rows x 54 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.merge(g,on='id',how='left')\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 划分数据后进行统计" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:09.726927Z", "start_time": "2021-04-06T09:41:09.702958Z" } }, "outputs": [], "source": [ "def group_feature(df, key, target, aggs,flag): \n", " \"\"\"通过字典的形式来构建方法和重命名\"\"\"\n", " agg_dict = {}\n", " for ag in aggs:\n", " agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag\n", "# print(agg_dict)\n", " t = df.groupby(key)[target].agg(agg_dict).reset_index()\n", " return t\n", "\n", "def extract_feature(df, train, flag):\n", " '''\n", " 统计feature\n", " 注意理解group_feature的使用和效果\n", " '''\n", " if (flag == 'on_night') or (flag == 'on_day'): \n", " t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left')\n", " # return train\n", " \n", " \n", " if flag == \"0\":\n", " t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left') \n", " elif flag == \"1\":\n", " t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left')\n", " t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left') \n", " # .nunique().to_dict() 将nunique得到的对应唯一值统计量做成字典\n", " # to_dict() 与 map的使用可以很方便地构建一些统计量映射特征,如CTR(分类)问题中的转化率\n", " # 提问: 如果根据训练集给定的label(0,1)来构建训练集+测试集的转化率特征,注:测试集与训练集存在部分id相同\n", " hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()\n", " train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique) \n", " hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()\n", " train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique) \n", "\n", " t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left')\n", " t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left')\n", " t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)\n", " train = pd.merge(train, t, on='ship', how='left')\n", "\n", " \n", " train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]\n", " train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]\n", " train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]\n", " train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]\n", " train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])\n", " train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)] \n", " \n", " mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()\n", " train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)\n", " train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])\n", "\n", " return train" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:11.295988Z", "start_time": "2021-04-06T09:41:10.995520Z" } }, "outputs": [], "source": [ "data = df.copy()\n", "data.rename(columns={\n", " 'id':'ship',\n", " 'v':'speed',\n", " 'dir':'direction'\n", "},inplace=True)\n", "# 去重\n", "data_label = data.drop_duplicates(['ship'],keep = 'first')\n", "\n", "data_1 = data[data['speed']==0]\n", "data_2 = data[data['speed']!=0]\n", "data_label = extract_feature(data_1, data_label,\"0\")\n", "data_label = extract_feature(data_2, data_label,\"1\")\n", "\n", "data_1 = data[data['day_nig'] == 0]\n", "data_2 = data[data['day_nig'] == 1]\n", "data_label = extract_feature(data_1, data_label,\"on_night\")\n", "data_label = extract_feature(data_2, data_label,\"on_day\")\n", "data_label.rename(columns={'ship':'id','speed':'v','direction':'dir'},inplace=True)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:11.527562Z", "start_time": "2021-04-06T09:41:11.473706Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
direction_max_0direction_median_0direction_mean_0direction_std_0direction_skew_0x_max_0x_min_0x_mean_0x_median_0x_std_0...base_dis_diff_std_on_daybase_dis_diff_skew_on_dayx_max_x_min_on_dayy_max_y_min_on_dayy_max_x_min_on_dayx_max_y_min_on_dayslope_on_dayarea_on_daymode_hour_on_dayslope_median_on_day
000.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
100.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
200.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
300.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
400.00.00.00.06.102751e+066.102751e+066.102751e+066.102751e+060.0...9650.263589-0.38959845396.66609243135.705758-989573.9820471.078106e+060.9501951.958217e+09190.831333
\n", "

5 rows × 127 columns

\n", "
" ], "text/plain": [ " direction_max_0 direction_median_0 direction_mean_0 direction_std_0 \\\n", "0 0 0.0 0.0 0.0 \n", "1 0 0.0 0.0 0.0 \n", "2 0 0.0 0.0 0.0 \n", "3 0 0.0 0.0 0.0 \n", "4 0 0.0 0.0 0.0 \n", "\n", " direction_skew_0 x_max_0 x_min_0 x_mean_0 x_median_0 \\\n", "0 0.0 6.102751e+06 6.102751e+06 6.102751e+06 6.102751e+06 \n", "1 0.0 6.102751e+06 6.102751e+06 6.102751e+06 6.102751e+06 \n", "2 0.0 6.102751e+06 6.102751e+06 6.102751e+06 6.102751e+06 \n", "3 0.0 6.102751e+06 6.102751e+06 6.102751e+06 6.102751e+06 \n", "4 0.0 6.102751e+06 6.102751e+06 6.102751e+06 6.102751e+06 \n", "\n", " x_std_0 ... base_dis_diff_std_on_day base_dis_diff_skew_on_day \\\n", "0 0.0 ... 9650.263589 -0.389598 \n", "1 0.0 ... 9650.263589 -0.389598 \n", "2 0.0 ... 9650.263589 -0.389598 \n", "3 0.0 ... 9650.263589 -0.389598 \n", "4 0.0 ... 9650.263589 -0.389598 \n", "\n", " x_max_x_min_on_day y_max_y_min_on_day y_max_x_min_on_day \\\n", "0 45396.666092 43135.705758 -989573.982047 \n", "1 45396.666092 43135.705758 -989573.982047 \n", "2 45396.666092 43135.705758 -989573.982047 \n", "3 45396.666092 43135.705758 -989573.982047 \n", "4 45396.666092 43135.705758 -989573.982047 \n", "\n", " x_max_y_min_on_day slope_on_day area_on_day mode_hour_on_day \\\n", "0 1.078106e+06 0.950195 1.958217e+09 19 \n", "1 1.078106e+06 0.950195 1.958217e+09 19 \n", "2 1.078106e+06 0.950195 1.958217e+09 19 \n", "3 1.078106e+06 0.950195 1.958217e+09 19 \n", "4 1.078106e+06 0.950195 1.958217e+09 19 \n", "\n", " slope_median_on_day \n", "0 0.831333 \n", "1 0.831333 \n", "2 0.831333 \n", "3 0.831333 \n", "4 0.831333 \n", "\n", "[5 rows x 127 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_cols = [i for i in data_label.columns if i not in df.columns]\n", "df = df.merge(data_label[new_cols+['id']],on='id',how='left')\n", "\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 统计特征的具体使用" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:13.059297Z", "start_time": "2021-04-06T09:41:12.464664Z" } }, "outputs": [], "source": [ "temp = df.copy()\n", "temp.rename(columns={'id':'ship','dir':'d'},inplace=True)\n", "\n", "def coefficient_of_variation(x):\n", " x = x.values\n", " if np.mean(x) == 0:\n", " return 0\n", " return np.std(x) / np.mean(x)\n", "\n", "def max_2(x):\n", " x = list(x.values)\n", " x.sort(reverse=True)\n", " return x[1]\n", "\n", "def max_3(x):\n", " x = list(x.values)\n", " x.sort(reverse=True)\n", " return x[2]\n", "\n", "def diff_abs_mean(x): # 统计特征 deta绝对值均值\n", " return np.mean(np.abs(np.diff(x)))\n", "\n", "f1 = pd.DataFrame()\n", "for col in ['x', 'y', 'v', 'd']:\n", " features = temp.groupby('ship', as_index=False)[col].agg({\n", " '{}_min'.format(col): 'min',\n", " '{}_max'.format(col): 'max',\n", " '{}_mean'.format(col): 'mean',\n", " '{}_median'.format(col): 'median',\n", " '{}_std'.format(col): 'std',\n", " '{}_skew'.format(col): 'skew',\n", " '{}_sum'.format(col): 'sum',\n", " '{}_diff_abs_mean'.format(col): diff_abs_mean,\n", " '{}_mode'.format(col): lambda x: x.value_counts().index[0],\n", " '{}_coefficient_of_variation'.format(col): coefficient_of_variation,\n", " '{}_max2'.format(col): max_2,\n", " '{}_max3'.format(col): max_3\n", " })\n", " if f1.shape[0] == 0:\n", " f1 = features\n", " else:\n", " f1 = f1.merge(features, on='ship', how='left')\n", "\n", "f1['x_max_x_min'] = f1['x_max'] - f1['x_min']\n", "f1['y_max_y_min'] = f1['y_max'] - f1['y_min']\n", "f1['y_max_x_min'] = f1['y_max'] - f1['x_min']\n", "f1['x_max_y_min'] = f1['x_max'] - f1['y_min']\n", "f1['slope'] = f1['y_max_y_min'] / np.where(f1['x_max_x_min'] == 0, 0.001, f1['x_max_x_min'])\n", "f1['area'] = f1['x_max_x_min'] * f1['y_max_y_min']\n", "f1['dis_max_min'] = (f1['x_max_x_min'] ** 2 + f1['y_max_y_min'] ** 2) ** 0.5\n", "f1['dis_mean'] = (f1['x_mean'] ** 2 + f1['y_mean'] ** 2) ** 0.5\n", "f1['area_d_dis_max_min'] = f1['area'] / f1['dis_max_min']\n", "\n", "# 加速度\n", "temp.sort_values(['ship', 'time'], ascending=True, inplace=True)\n", "temp['ynext'] = temp.groupby('ship')['y'].shift(-1)\n", "temp['xnext'] = temp.groupby('ship')['x'].shift(-1)\n", "temp['ynext'] = temp['ynext'].fillna(method='ffill')\n", "temp['xnext'] = temp['xnext'].fillna(method='ffill')\n", "temp['timenext'] = temp.groupby('ship')['time'].shift(-1)\n", "temp['timediff'] = np.abs(temp['timenext'] - temp['time'])\n", "temp['a_y'] = temp.apply(lambda x: (x['ynext'] - x['y']) / x['timediff'].total_seconds(), axis=1)\n", "temp['a_x'] = temp.apply(lambda x: (x['xnext'] - x['x']) / x['timediff'].total_seconds(), axis=1)\n", "for col in ['a_y', 'a_x']:\n", " f2 = temp.groupby('ship', as_index=False)[col].agg({\n", " '{}_max'.format(col): 'max',\n", " '{}_mean'.format(col): 'mean',\n", " '{}_min'.format(col): 'min',\n", " '{}_median'.format(col): 'median',\n", " '{}_std'.format(col): 'std'})\n", " f1 = f1.merge(f2, on='ship', how='left')\n", "\n", "# 曲率\n", "temp['y_pre'] = temp.groupby('ship')['y'].shift(1)\n", "temp['x_pre'] = temp.groupby('ship')['x'].shift(1)\n", "temp['y_pre'] = temp['y_pre'].fillna(method='bfill')\n", "temp['x_pre'] = temp['x_pre'].fillna(method='bfill')\n", "temp['d_pre'] = ((temp['x'] - temp['x_pre']) ** 2 + (temp['y'] - temp['y_pre']) ** 2) ** 0.5\n", "temp['d_next'] = ((temp['xnext'] - temp['x']) ** 2 + (temp['ynext'] - temp['y']) ** 2) ** 0.5\n", "temp['d_pre_next'] = ((temp['xnext'] - temp['x_pre']) ** 2 + (temp['ynext'] - temp['y_pre']) ** 2) ** 0.5\n", "temp['curvature'] = (temp['d_pre'] + temp['d_next']) / temp['d_pre_next']\n", "\n", "f2 = temp.groupby('ship', as_index=False)['curvature'].agg({\n", " 'curvature_max': 'max',\n", " 'curvature_mean': 'mean',\n", " 'curvature_min': 'min',\n", " 'curvature_median': 'median',\n", " 'curvature_std': 'std'})\n", "f1 = f1.merge(f2, on='ship', how='left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# embedding特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Question!\n", "\n", "为什么在数据挖掘类比赛中,我们需要word2vec或NMF(方法有很多,但这两种常用)来构造 “词嵌入特征”?\n", "\n", "答: 上分!\n", "\n", "确实,上分是现象,但背后却是对整体数据的考虑,上述的统计特征、业务特征等也都是考虑了数据的整体性,但是却难免忽略了数据间的关系。举个例子,对于所有人的年龄特征,如果仅做一些统计特征如平均值、最值,业务特征如标准体重=体重/年龄等,这些都是人为理解的。那将这些特征想象成一个个词,并将所有数据(或同一组数据)的这些词组合当成一篇文章来考虑,是不是就可以得到一些额外的规律,即特征。\n", "\n", "- 简介\n", "\n", "所谓word embedding就是把一个词用编码的方式表示以便于feed到网络中。Word Embedding有的时候也被称作为分布式语义模型或向量空间模型等,所以从名字和其转换的方式我们就可以明白, Word Embedding技术可以将相同类型的词归到一起,例如苹果,芒果香蕉等,在投影之后的向量空间距离就会更近,而书本,房子这些则会与苹果这些词的距离相对较远。\n", "\n", "- 使用场景\n", "\n", "目前为止,Word Embedding可以用到特征生成,文件聚类,文本分类和自然语言处理等任务,例如:\n", "\n", "计算相似的词:Word Embedding可以被用来寻找与某个词相近的词。\n", "\n", "构建一群相关的词:对不同的词进行聚类,将相关的词聚集到一起;\n", "\n", "用于文本分类的特征:在文本分类问题中,因为词没法直接用于机器学习模型的训练,所以我们将词先投影到向量空间,这样之后便可以基于这些向量进行机器学习模型的训练;\n", "\n", "用于文件的聚类\n", "\n", "上面列举的是文本相关任务,当然目前词嵌入模型已经被扩展到方方面面。典型的,例如:\n", "\n", "在微博上面,每个人都用一个词来表示,对每个人构建Embedding,然后计算人之间的相关性,得到关系最为相近的人;\n", "\n", "在推荐问题里面,依据每个用户的购买的商品记录,对每个商品进行Embedding,就可以计算商品之间的相关性,并进行推荐;\n", "\n", "在此次天池的航海问题中,对相同经纬度上不同的船进行Embedding,就可以得到每个船只的向量,就可以得到经常在某些区域工作的船只;\n", "\n", "可以说,词嵌入为寻找物体之间相关性带来了巨大的帮助。现在基本每个数据竞赛都会见到Embedding技术。让我们来看看用的最多的Word2Vec模型。\n", "\n", "- Word2Vec在做什么?\n", "\n", "Word2vec在向量空间中对词进行表示, 或者说词以向量的形式表示,在词向量空间中:相似含义的单词一起出现,而不同的单词则位于很远的地方。这也被称为语义关系。\n", "\n", "神经网络不理解文本,而只理解数字。词嵌入提供了一种将文本转换为数字向量的方法。\n", "\n", "Word2vec就是在重建词的语言上下文。那什么是语言上下文?在一般的生活情景中,当我们通过说话或写作来交流,其他人会试图找出句子的目的。例如,“印度的温度是多少”,这里的上下文是用户想知道“印度的温度”即上下文。\n", "\n", "简而言之,句子的主要目标是语境。围绕口头或书面语言的单词或句子(披露)有助于确定上下文的意义。Word2vec通过上下文学习单词的矢量表示。\n", "\n", "- 参考文献\n", "\n", "[NLP] 秒懂词向量Word2vec的本质:https://zhuanlan.zhihu.com/p/26306795" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word2vec构造词向量" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:13.778719Z", "start_time": "2021-04-06T09:41:13.764759Z" } }, "outputs": [], "source": [ "def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,\n", " iters=40, min_count=3, window_size=25,\n", " seed=9012, num_runs=5, word_feat=\"no_bin\"):\n", " \"\"\"CBOW embedding for trajectory data.\"\"\"\n", " boat_id = traj_data_corpus['id'].unique()\n", " sentences, embedding_df_list, embedding_model_list = [], [], []\n", " for i in boat_id:\n", " traj = traj_data_corpus[traj_data_corpus['id']==i]\n", " sentences.append(traj[word_feat].values.tolist())\n", "\n", " print(\"\\n@Start CBOW word embedding at {}\".format(datetime.now()))\n", " print(\"-------------------------------------------\")\n", " for i in tqdm(range(num_runs)):\n", " model = Word2Vec(sentences, size=embedding_size,\n", " min_count=min_count,\n", " workers=mp.cpu_count(),\n", " window=window_size,\n", " seed=seed, iter=iters, sg=0)\n", "\n", " # Sentance vector\n", " embedding_vec = []\n", " for ind, seq in enumerate(sentences):\n", " seq_vec, word_count = 0, 0\n", " for word in seq:\n", " if word not in model:\n", " continue\n", " else:\n", " seq_vec += model[word]\n", " word_count += 1\n", " if word_count == 0:\n", " embedding_vec.append(embedding_size * [0])\n", " else:\n", " embedding_vec.append(seq_vec / word_count)\n", " embedding_vec = np.array(embedding_vec)\n", " embedding_cbow_df = pd.DataFrame(embedding_vec, \n", " columns=[\"embedding_cbow_{}_{}\".format(word_feat, i) for i in range(embedding_size)])\n", " embedding_cbow_df[\"id\"] = boat_id\n", " embedding_df_list.append(embedding_cbow_df)\n", " embedding_model_list.append(model)\n", " print(\"-------------------------------------------\")\n", " print(\"@End CBOW word embedding at {}\".format(datetime.now()))\n", " return embedding_df_list, embedding_model_list" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:14.390155Z", "start_time": "2021-04-06T09:41:14.128633Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", " 0%| | 0/1 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
embedding_cbow_no_bin_0embedding_cbow_no_bin_1embedding_cbow_no_bin_2embedding_cbow_no_bin_3embedding_cbow_no_bin_4embedding_cbow_no_bin_5embedding_cbow_no_bin_6embedding_cbow_no_bin_7embedding_cbow_no_bin_8embedding_cbow_no_bin_9...embedding_cbow_no_bin_60embedding_cbow_no_bin_61embedding_cbow_no_bin_62embedding_cbow_no_bin_63embedding_cbow_no_bin_64embedding_cbow_no_bin_65embedding_cbow_no_bin_66embedding_cbow_no_bin_67embedding_cbow_no_bin_68embedding_cbow_no_bin_69
00.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
10.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
20.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
30.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
40.1138760.9155070.748654-0.4747160.0259360.8917440.404129-0.733450.6645010.025082...-0.4608460.0965310.1069790.869454-0.4921840.166157-0.280037-0.351043-0.832541-0.139282
\n", "

5 rows × 70 columns

\n", "" ], "text/plain": [ " embedding_cbow_no_bin_0 embedding_cbow_no_bin_1 embedding_cbow_no_bin_2 \\\n", "0 0.113876 0.915507 0.748654 \n", "1 0.113876 0.915507 0.748654 \n", "2 0.113876 0.915507 0.748654 \n", "3 0.113876 0.915507 0.748654 \n", "4 0.113876 0.915507 0.748654 \n", "\n", " embedding_cbow_no_bin_3 embedding_cbow_no_bin_4 embedding_cbow_no_bin_5 \\\n", "0 -0.474716 0.025936 0.891744 \n", "1 -0.474716 0.025936 0.891744 \n", "2 -0.474716 0.025936 0.891744 \n", "3 -0.474716 0.025936 0.891744 \n", "4 -0.474716 0.025936 0.891744 \n", "\n", " embedding_cbow_no_bin_6 embedding_cbow_no_bin_7 embedding_cbow_no_bin_8 \\\n", "0 0.404129 -0.73345 0.664501 \n", "1 0.404129 -0.73345 0.664501 \n", "2 0.404129 -0.73345 0.664501 \n", "3 0.404129 -0.73345 0.664501 \n", "4 0.404129 -0.73345 0.664501 \n", "\n", " embedding_cbow_no_bin_9 ... embedding_cbow_no_bin_60 \\\n", "0 0.025082 ... -0.460846 \n", "1 0.025082 ... -0.460846 \n", "2 0.025082 ... -0.460846 \n", "3 0.025082 ... -0.460846 \n", "4 0.025082 ... -0.460846 \n", "\n", " embedding_cbow_no_bin_61 embedding_cbow_no_bin_62 \\\n", "0 0.096531 0.106979 \n", "1 0.096531 0.106979 \n", "2 0.096531 0.106979 \n", "3 0.096531 0.106979 \n", "4 0.096531 0.106979 \n", "\n", " embedding_cbow_no_bin_63 embedding_cbow_no_bin_64 \\\n", "0 0.869454 -0.492184 \n", "1 0.869454 -0.492184 \n", "2 0.869454 -0.492184 \n", "3 0.869454 -0.492184 \n", "4 0.869454 -0.492184 \n", "\n", " embedding_cbow_no_bin_65 embedding_cbow_no_bin_66 \\\n", "0 0.166157 -0.280037 \n", "1 0.166157 -0.280037 \n", "2 0.166157 -0.280037 \n", "3 0.166157 -0.280037 \n", "4 0.166157 -0.280037 \n", "\n", " embedding_cbow_no_bin_67 embedding_cbow_no_bin_68 \\\n", "0 -0.351043 -0.832541 \n", "1 -0.351043 -0.832541 \n", "2 -0.351043 -0.832541 \n", "3 -0.351043 -0.832541 \n", "4 -0.351043 -0.832541 \n", "\n", " embedding_cbow_no_bin_69 \n", "0 -0.139282 \n", "1 -0.139282 \n", "2 -0.139282 \n", "3 -0.139282 \n", "4 -0.139282 \n", "\n", "[5 rows x 70 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "df = df.merge(fea,on='id',how='left')\n", "\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:15.479705Z", "start_time": "2021-04-06T09:41:15.037950Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.47it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "@Round 2 speed embedding:\n", "\n", "@Start CBOW word embedding at 2021-04-06 17:41:15.054905\n", "-------------------------------------------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.44it/s]\n", " 0%| | 0/1 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
embedding_cbow_speed_str_0embedding_cbow_speed_str_1embedding_cbow_speed_str_2embedding_cbow_speed_str_3embedding_cbow_speed_str_4embedding_cbow_speed_str_5embedding_cbow_speed_str_6embedding_cbow_speed_str_7embedding_cbow_speed_str_8embedding_cbow_speed_str_9...embedding_cbow_speed_dir_str_2embedding_cbow_speed_dir_str_3embedding_cbow_speed_dir_str_4embedding_cbow_speed_dir_str_5embedding_cbow_speed_dir_str_6embedding_cbow_speed_dir_str_7embedding_cbow_speed_dir_str_8embedding_cbow_speed_dir_str_9embedding_cbow_speed_dir_str_10embedding_cbow_speed_dir_str_11
0-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
1-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
2-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
3-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
4-1.7517120.833441.1751482.3507260.081093-1.5321532.6988670.873376-0.839753-0.537248...1.7773331.0098880.8469122.1015651.7212072.3759472.7873260.845491-2.0647371.990452
\n", "

5 rows × 22 columns

\n", "" ], "text/plain": [ " embedding_cbow_speed_str_0 embedding_cbow_speed_str_1 \\\n", "0 -1.751712 0.83344 \n", "1 -1.751712 0.83344 \n", "2 -1.751712 0.83344 \n", "3 -1.751712 0.83344 \n", "4 -1.751712 0.83344 \n", "\n", " embedding_cbow_speed_str_2 embedding_cbow_speed_str_3 \\\n", "0 1.175148 2.350726 \n", "1 1.175148 2.350726 \n", "2 1.175148 2.350726 \n", "3 1.175148 2.350726 \n", "4 1.175148 2.350726 \n", "\n", " embedding_cbow_speed_str_4 embedding_cbow_speed_str_5 \\\n", "0 0.081093 -1.532153 \n", "1 0.081093 -1.532153 \n", "2 0.081093 -1.532153 \n", "3 0.081093 -1.532153 \n", "4 0.081093 -1.532153 \n", "\n", " embedding_cbow_speed_str_6 embedding_cbow_speed_str_7 \\\n", "0 2.698867 0.873376 \n", "1 2.698867 0.873376 \n", "2 2.698867 0.873376 \n", "3 2.698867 0.873376 \n", "4 2.698867 0.873376 \n", "\n", " embedding_cbow_speed_str_8 embedding_cbow_speed_str_9 ... \\\n", "0 -0.839753 -0.537248 ... \n", "1 -0.839753 -0.537248 ... \n", "2 -0.839753 -0.537248 ... \n", "3 -0.839753 -0.537248 ... \n", "4 -0.839753 -0.537248 ... \n", "\n", " embedding_cbow_speed_dir_str_2 embedding_cbow_speed_dir_str_3 \\\n", "0 1.777333 1.009888 \n", "1 1.777333 1.009888 \n", "2 1.777333 1.009888 \n", "3 1.777333 1.009888 \n", "4 1.777333 1.009888 \n", "\n", " embedding_cbow_speed_dir_str_4 embedding_cbow_speed_dir_str_5 \\\n", "0 0.846912 2.101565 \n", "1 0.846912 2.101565 \n", "2 0.846912 2.101565 \n", "3 0.846912 2.101565 \n", "4 0.846912 2.101565 \n", "\n", " embedding_cbow_speed_dir_str_6 embedding_cbow_speed_dir_str_7 \\\n", "0 1.721207 2.375947 \n", "1 1.721207 2.375947 \n", "2 1.721207 2.375947 \n", "3 1.721207 2.375947 \n", "4 1.721207 2.375947 \n", "\n", " embedding_cbow_speed_dir_str_8 embedding_cbow_speed_dir_str_9 \\\n", "0 2.787326 0.845491 \n", "1 2.787326 0.845491 \n", "2 2.787326 0.845491 \n", "3 2.787326 0.845491 \n", "4 2.787326 0.845491 \n", "\n", " embedding_cbow_speed_dir_str_10 embedding_cbow_speed_dir_str_11 \n", "0 -2.064737 1.990452 \n", "1 -2.064737 1.990452 \n", "2 -2.064737 1.990452 \n", "3 -2.064737 1.990452 \n", "4 -2.064737 1.990452 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pre_cols = df.columns\n", "df = df.merge(total_embedding,on='id',how='left')\n", "\n", "new_cols = [i for i in df.columns if i not in pre_cols]\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NMF提取文本的主题分布" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:16.295670Z", "start_time": "2021-04-06T09:41:16.271696Z" } }, "outputs": [], "source": [ "class nmf_list(object):\n", " def __init__(self,data,by_name,to_list,nmf_n,top_n):\n", " self.data = data\n", " self.by_name = by_name\n", " self.to_list = to_list\n", " self.nmf_n = nmf_n\n", " self.top_n = top_n\n", "\n", " def run(self,tf_n):\n", " df_all = self.data.groupby(self.by_name)[self.to_list].apply(lambda x :'|'.join(x)).reset_index()\n", " self.data =df_all.copy()\n", "\n", " print('bulid word_fre')\n", " # 词频的构建\n", " def word_fre(x):\n", " word_dict = []\n", " x = x.split('|')\n", " docs = []\n", " for doc in x:\n", " doc = doc.split()\n", " docs.append(doc)\n", " word_dict.extend(doc)\n", " word_dict = Counter(word_dict)\n", " new_word_dict = {}\n", " for key,value in word_dict.items():\n", " new_word_dict[key] = [value,0]\n", " del word_dict \n", " del x\n", " for doc in docs:\n", " doc = Counter(doc)\n", " for word in doc.keys():\n", " new_word_dict[word][1] += 1\n", " return new_word_dict \n", " self.data['word_fre'] = self.data[self.to_list].apply(word_fre)\n", "\n", " print('bulid top_' + str(self.top_n))\n", " # 设定100个高频词\n", " def top_100(word_dict):\n", " return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]\n", " self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)\n", " def top_100_word(word_list):\n", " words = []\n", " for i in word_list:\n", " i = list(i)\n", " words.append(i[0])\n", " return words \n", " self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)\n", " # print('top_'+str(self.top_n)+'_word的shape')\n", " print(self.data.shape)\n", "\n", " word_list = []\n", " for i in self.data['top_'+str(self.top_n)+'_word'].values:\n", " word_list.extend(i)\n", " word_list = Counter(word_list)\n", " word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)\n", " user_fre = []\n", " for i in word_list:\n", " i = list(i)\n", " user_fre.append(i[1]/self.data[self.by_name].nunique())\n", " stop_words = []\n", " for i,j in zip(word_list,user_fre):\n", " if j>0.5:\n", " i = list(i)\n", " stop_words.append(i[0])\n", "\n", " print('start title_feature')\n", " # 讲融合后的taglist当作一句话进行文本处理\n", " self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))\n", " self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])\n", " self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))\n", "\n", " print('start NMF')\n", " # 使用tfidf对元素进行处理\n", " tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))\n", " tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)\n", " #使用nmf算法,提取文本的主题分布\n", " text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)\n", "\n", "\n", " # 整理并输出文件\n", " name = [str(tf_n) + self.to_list + '_' +str(x) for x in range(1,self.nmf_n+1)]\n", " tag_list = pd.DataFrame(text_nmf)\n", " print(tag_list.shape)\n", " tag_list.columns = name\n", " tag_list[self.by_name] = self.data[self.by_name]\n", " column_name = [self.by_name] + name\n", " tag_list = tag_list[column_name]\n", " return tag_list" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:17.109358Z", "start_time": "2021-04-06T09:41:16.763209Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "********* 1 *******\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "********* 2 *******\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "********* 3 *******\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n", "bulid word_fre\n", "bulid top_2\n", "(6, 5)\n", "start title_feature\n", "start NMF\n", "(6, 8)\n" ] } ], "source": [ "data = df.copy()\n", "data.rename(columns={'v':'speed','id':'ship'},inplace=True)\n", "for j in range(1,4):\n", " print('********* {} *******'.format(j))\n", " for i in ['speed','x','y']:\n", " data[i + '_str'] = data[i].astype(str)\n", " nmf = nmf_list(data,'ship',i + '_str',8,2)\n", " nmf_a = nmf.run(j)\n", " nmf_a.rename(columns={'ship':'id'},inplace=True)\n", " data_label = data_label.merge(nmf_a,on = 'id',how = 'left')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2021-04-06T09:41:17.543827Z", "start_time": "2021-04-06T09:41:17.473051Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1speed_str_11speed_str_21speed_str_31speed_str_41speed_str_51speed_str_61speed_str_71speed_str_81x_str_11x_str_2...3x_str_73x_str_83y_str_13y_str_23y_str_33y_str_43y_str_53y_str_63y_str_73y_str_8
00.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
10.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
20.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
30.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
40.00.00.0143680.00.0099870.3139810.00.1040360.00.0...0.00.127430.00.00.00.0910.00.00.00.0
\n", "

5 rows × 72 columns

\n", "
" ], "text/plain": [ " 1speed_str_1 1speed_str_2 1speed_str_3 1speed_str_4 1speed_str_5 \\\n", "0 0.0 0.0 0.014368 0.0 0.009987 \n", "1 0.0 0.0 0.014368 0.0 0.009987 \n", "2 0.0 0.0 0.014368 0.0 0.009987 \n", "3 0.0 0.0 0.014368 0.0 0.009987 \n", "4 0.0 0.0 0.014368 0.0 0.009987 \n", "\n", " 1speed_str_6 1speed_str_7 1speed_str_8 1x_str_1 1x_str_2 ... \\\n", "0 0.313981 0.0 0.104036 0.0 0.0 ... \n", "1 0.313981 0.0 0.104036 0.0 0.0 ... \n", "2 0.313981 0.0 0.104036 0.0 0.0 ... \n", "3 0.313981 0.0 0.104036 0.0 0.0 ... \n", "4 0.313981 0.0 0.104036 0.0 0.0 ... \n", "\n", " 3x_str_7 3x_str_8 3y_str_1 3y_str_2 3y_str_3 3y_str_4 3y_str_5 \\\n", "0 0.0 0.12743 0.0 0.0 0.0 0.091 0.0 \n", "1 0.0 0.12743 0.0 0.0 0.0 0.091 0.0 \n", "2 0.0 0.12743 0.0 0.0 0.0 0.091 0.0 \n", "3 0.0 0.12743 0.0 0.0 0.0 0.091 0.0 \n", "4 0.0 0.12743 0.0 0.0 0.0 0.091 0.0 \n", "\n", " 3y_str_6 3y_str_7 3y_str_8 \n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", "[5 rows x 72 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_cols = [i for i in data_label.columns if i not in df.columns]\n", "df = df.merge(data_label[new_cols+['id']],on='id',how='left')\n", "\n", "df[new_cols].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 总结与思考" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 赛题特征工程:该如何构建有效果的赛题特征工程\n", " \n", " 参考:通过数据EDA、查阅对应赛题的参考文献,寻找并构建有实际意义的业务特征\n", "\n", "\n", "- 分箱特征:几乎所有topline代码中均有分箱特征的构造,为何分箱特征如此重要且有效。在什么情况下使用分箱特征的效果好?(为什么本赛题需要分箱特征)\n", " \n", " 参考:分箱的原理\n", "\n", "- DataFrame特征:针对pandas DataFrame的内置方法的使用,可以构造出大量的统计特征。建议:自行整理一份针对表格数据的统计特征构造函数\n", " \n", " 参考:DataWhale的joyful pandas\n", "\n", "\n", "- Embedding特征:上分秘籍,将序列转换成NLP文本中的一句话或一篇文章进行特征向量化为何效果如此之好。如何针对给定数据,调整参数构造较好的词向量?\n", " \n", " 参考:Word2vec的学习" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 附录\n", "\n", "## 学习来源\n", "1 团队名称:Pursuing the Past Youth\n", "链接:\n", "https://github.com/juzstu/TianChi_HaiYang\n", "\n", "2 团队名称:liu123的航空母舰队\n", "链接:\n", "https://github.com/MichaelYin1994/tianchi-trajectory-data-mining\n", "\n", "3 团队名称:天才海神号\n", "链接:\n", "https://github.com/fengdu78/tianchi_haiyang?spm=5176.12282029.0.0.5b97301792pLch\n", "\n", "4 团队名称:大白\n", "链接:\n", "https://github.com/Ai-Light/2020-zhihuihaiyang\n", "\n", "5 团队名称:抗毒救灾\n", "链接:\n", "https://github.com/wudejian789/2020DCIC_A_Rank7_B_Rank12\n", "\n", "6 团队名称:蜗牛坐车里团队\n", "链接:\n", "https://tianchi.aliyun.com/notebook-ai/detail?postId=114808\n", "\n", "7 团队名称:用欧气驱散疫情\n", "链接:\n", "https://github.com/tudoulei/2020-Digital-China-Innovation-Competition\n", "\n", "## 数据\n", "所用数据是 hy_round1_train_20200102(初赛数据)\n", "\n", "## 运行过程\n", "针对各团队的整理的详细运行代码见 ipynb/*.ipynb\n", "数字序号与上面相同\n", "\n", "## 运行结果\n", "文件输出见 result/*.csv\n", "\n", "## 部分解释\n", "\n", "- 【天池智慧海洋建设】Topline源码——特征工程学习(大白):\n", "https://blog.csdn.net/qq_44574333/article/details/115188086\n", "s\n", "- 【天池智慧海洋建设】Topline源码——特征工程学习(Pursuing the Past Youth):\n", "https://blog.csdn.net/qq_44574333/article/details/112547081\n", "\n", "- 【天池智慧海洋建设】Topline源码——特征工程学习(天才海神号):\n", "https://blog.csdn.net/qq_44574333/article/details/115185634\n", "\n", "- 【天池智慧海洋建设】Topline源码——特征工程学习(liu123的航空母舰队):\n", "https://blog.csdn.net/qq_44574333/article/details/115091764\n", "\n", "## 推荐的学习资料\n", "实战类:知名比赛的topline代码,如kaggle、天池等平台的开源代码\n", "\n", "书籍类: \n", " \n", " +《阿里云天池大赛赛题解析》\n", " \n", " 【笔者也有博客笔记学习(https://blog.csdn.net/qq_44574333/article/details/109611764)】\n", " \n", " +《美团机器学习实战》\n", " \n", "\n", "教程类:\n", "\n", " + Joyful Pandas 强烈推荐!基础且高效\n", " http://joyfulpandas.datawhale.club/" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "580px", "left": "53px", "top": "143px", "width": "307.2px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }